0.0.0.0.1 Our added answers/comments are always in italic!

1 Where Do People Drink The Most Beer, Wine And Spirits?

library(fivethirtyeight)
data(drinks)


# or download directly
alcohol_direct <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/alcohol-consumption/drinks.csv")
  • What are the variable types? Any missing values we should worry about?
skim(alcohol_direct)
Data summary
Name alcohol_direct
Number of rows 193
Number of columns 5
_______________________
Column type frequency:
character 1
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 3 28 0 193 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
beer_servings 0 1 106.16 101.14 0 20.0 76.0 188.0 376.0 ▇▃▂▂▁
spirit_servings 0 1 80.99 88.28 0 4.0 56.0 128.0 438.0 ▇▃▂▁▁
wine_servings 0 1 49.45 79.70 0 1.0 8.0 59.0 370.0 ▇▁▁▁▁
total_litres_of_pure_alcohol 0 1 4.72 3.77 0 1.3 4.2 7.2 14.4 ▇▃▅▃▁

As the report on n_missing in the above output shows, there don’t seem to be any missing values for any variable in the data set. Moreover, we can observe that the only non-numeric variable is country, while the four remaining are numeric.


  • Make a plot that shows the top 25 beer consuming countries
# YOUR CODE GOES HERE
beer_consumtion <- alcohol_direct %>%
  slice_max(order_by = beer_servings, n = 25)

ggplot(beer_consumtion, aes(y=fct_reorder(country,beer_servings), x=beer_servings))+
  geom_col(fill = "orange")+
  theme_bw(  )+
  labs(title = "Beer Consumption", subtitle = "The top beer consuming countries", x="Annual Servings", y="Country")

  • Make a plot that shows the top 25 wine consuming countries
wine_consumption <- alcohol_direct %>%
  slice_max(order_by = wine_servings, n = 25)

ggplot(wine_consumption, aes(y=fct_reorder(country,wine_servings), x=wine_servings))+
  geom_col(fill = "darkred")+
  theme_bw(  )+
  labs(title = "Wine Consumption", subtitle = "The top wine consuming countries", x="Annual Servings", y="Country")

  • Finally, make a plot that shows the top 25 spirit consuming countries
# YOUR CODE GOES HERE
spirit_consumtion <- alcohol_direct %>%
  slice_max(order_by = spirit_servings, n = 25)

ggplot(spirit_consumtion, aes(y=fct_reorder(country,spirit_servings), x=spirit_servings))+
  geom_col(fill = "lightblue")+
  theme_bw(  )+
  labs(title = "Spirit Consumption", subtitle = "The top spirit consuming countries", x="Annual Servings", y="Country")

  • What can you infer from these plots? Don’t just explain what’s in the graph, but speculate or tell a short story (1-2 paragraphs max).

In general, there tends to be no overlap among countries with regards to the top spots in alcohol consumption, i.e. the top beer consuming countries are not the top wine consuming countries. It is surprising that Germany is only ranked #4 in the top beer consuming countries, after Namibia, Czech Republic and Gabon. A quick research on the reason for Namibia’s high beer consumption hint at its former colonial ties with Germany (see here. Also, one can observe that the top wine consuming countries are almost exclusively European, which might be due to the reason that wine is culturally rooted in these countries and many of them are also known for their production of wine.


2 Analysis of movies- IMDB dataset

movies <- read_csv(here::here("data", "movies.csv"))
glimpse(movies)
## Rows: 2,961
## Columns: 11
## $ title               <chr> "Avatar", "Titanic", "Jurassic World", "The Avenge…
## $ genre               <chr> "Action", "Drama", "Action", "Action", "Action", "…
## $ director            <chr> "James Cameron", "James Cameron", "Colin Trevorrow…
## $ year                <dbl> 2009, 1997, 2015, 2012, 2008, 1999, 1977, 2015, 20…
## $ duration            <dbl> 178, 194, 124, 173, 152, 136, 125, 141, 164, 93, 1…
## $ gross               <dbl> 7.61e+08, 6.59e+08, 6.52e+08, 6.23e+08, 5.33e+08, …
## $ budget              <dbl> 2.37e+08, 2.00e+08, 1.50e+08, 2.20e+08, 1.85e+08, …
## $ cast_facebook_likes <dbl> 4834, 45223, 8458, 87697, 57802, 37723, 13485, 920…
## $ votes               <dbl> 886204, 793059, 418214, 995415, 1676169, 534658, 9…
## $ reviews             <dbl> 3777, 2843, 1934, 2425, 5312, 3917, 1752, 1752, 35…
## $ rating              <dbl> 7.9, 7.7, 7.0, 8.1, 9.0, 6.5, 8.7, 7.5, 8.5, 7.2, …
  • Are there any missing values (NAs)? Are all entries distinct or are there duplicate entries?
skim(movies)
Data summary
Name movies
Number of rows 2961
Number of columns 11
_______________________
Column type frequency:
character 3
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 1 83 0 2907 0
genre 0 1 5 11 0 17 0
director 0 1 3 32 0 1366 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2.00e+03 9.95e+00 1920.0 2.00e+03 2.00e+03 2.01e+03 2.02e+03 ▁▁▁▂▇
duration 0 1 1.10e+02 2.22e+01 37.0 9.50e+01 1.06e+02 1.19e+02 3.30e+02 ▃▇▁▁▁
gross 0 1 5.81e+07 7.25e+07 703.0 1.23e+07 3.47e+07 7.56e+07 7.61e+08 ▇▁▁▁▁
budget 0 1 4.06e+07 4.37e+07 218.0 1.10e+07 2.60e+07 5.50e+07 3.00e+08 ▇▂▁▁▁
cast_facebook_likes 0 1 1.24e+04 2.05e+04 0.0 2.24e+03 4.60e+03 1.69e+04 6.57e+05 ▇▁▁▁▁
votes 0 1 1.09e+05 1.58e+05 5.0 1.99e+04 5.57e+04 1.33e+05 1.69e+06 ▇▁▁▁▁
reviews 0 1 5.03e+02 4.94e+02 2.0 1.99e+02 3.64e+02 6.31e+02 5.31e+03 ▇▁▁▁▁
rating 0 1 6.39e+00 1.05e+00 1.6 5.80e+00 6.50e+00 7.10e+00 9.30e+00 ▁▁▆▇▁

There are no missing values as can be observed when analysing n_missing, however, there are duplicate values for some variables. What is especially interesting, is that there are duplicate Titles for movies. This can be observed through looking at n_unique: Even though there are a total of 2961 records, there only seem to be 2907 unique movie titles.


  • Produce a table with the count of movies by genre, ranked in descending order
count_moviesByGenre<-movies%>%
  group_by(genre)%>%
  count(sort=TRUE)%>%
  rename("number_of_movies" = n)

count_moviesByGenre
## # A tibble: 17 × 2
## # Groups:   genre [17]
##    genre       number_of_movies
##    <chr>                  <int>
##  1 Comedy                   848
##  2 Action                   738
##  3 Drama                    498
##  4 Adventure                288
##  5 Crime                    202
##  6 Biography                135
##  7 Horror                   131
##  8 Animation                 35
##  9 Fantasy                   28
## 10 Documentary               25
## 11 Mystery                   16
## 12 Sci-Fi                     7
## 13 Family                     3
## 14 Musical                    2
## 15 Romance                    2
## 16 Western                    2
## 17 Thriller                   1

There seems to be quite a large difference in the number of records across genres in this dataset, as there is information on 848 Comedy movies opposed to only 1 on Thrillers.


  • Produce a table with the average gross earning and budget (gross and budget) by genre. Calculate a variable return_on_budget which shows how many $ did a movie make at the box office for each $ of its budget. Rank genres by this return_on_budget in descending order.
avgGenre <- movies%>%
  group_by(genre)%>%
  summarise(mean(gross), mean(budget))%>%
  
  rename("Average_Gross" = "mean(gross)", "Average_Budget" = "mean(budget)")%>%
  
  mutate(return_on_budget = Average_Gross/Average_Budget)%>%
  
  mutate(Average_Gross = dollar(Average_Gross), Average_Budget = dollar(Average_Budget))%>% # The dollar function is from the scales package and allows us to make the numbers more easily readable.
  mutate(return_on_budget = round(return_on_budget, 3))%>% #We round the return column to the 3rd nearest decimal.
  
  arrange(desc(return_on_budget))

avgGenre
## # A tibble: 17 × 4
##    genre       Average_Gross Average_Budget return_on_budget
##    <chr>       <chr>         <chr>                     <dbl>
##  1 Musical     $92,084,000   $3,189,500               28.9  
##  2 Family      $149,160,478  $14,833,333              10.1  
##  3 Western     $20,821,884   $3,465,000                6.01 
##  4 Documentary $17,353,973   $5,887,852                2.95 
##  5 Horror      $37,713,738   $13,504,916               2.79 
##  6 Fantasy     $42,408,841   $17,582,143               2.41 
##  7 Comedy      $42,630,552   $24,446,319               1.74 
##  8 Mystery     $67,533,021   $39,218,750               1.72 
##  9 Animation   $98,433,792   $61,701,429               1.60 
## 10 Biography   $45,201,805   $28,543,696               1.58 
## 11 Adventure   $95,794,257   $66,290,069               1.44 
## 12 Drama       $37,465,371   $26,242,933               1.43 
## 13 Crime       $37,502,397   $26,596,169               1.41 
## 14 Romance     $31,264,848   $25,107,500               1.25 
## 15 Action      $86,583,860   $71,354,888               1.21 
## 16 Sci-Fi      $29,788,371   $27,607,143               1.08 
## 17 Thriller    $2,468        $300,000                  0.008

  • Produce a table that shows the top 15 directors who have created the highest gross revenue in the box office. Don’t just show the total gross amount, but also the mean, median, and standard deviation per director.
topDirectors <- movies%>%
  group_by(director)%>%
  summarise(total_gross = sum(gross), avg_gross = mean(gross), median_gross = median(gross), sd_gross = sd(gross))%>%
  slice_max(order_by = total_gross, n = 15)%>%
  
  mutate(total_gross = dollar(total_gross), avg_gross = dollar(avg_gross), median_gross = dollar(median_gross), sd_gross = dollar(sd_gross))

topDirectors
## # A tibble: 15 × 5
##    director          total_gross    avg_gross    median_gross sd_gross    
##    <chr>             <chr>          <chr>        <chr>        <chr>       
##  1 Steven Spielberg  $4,014,061,704 $174,524,422 $164,435,221 $101,421,051
##  2 Michael Bay       $2,231,242,537 $171,634,041 $138,396,624 $127,161,579
##  3 Tim Burton        $2,071,275,480 $129,454,718 $76,519,172  $108,726,924
##  4 Sam Raimi         $2,014,600,898 $201,460,090 $234,903,076 $162,126,632
##  5 James Cameron     $1,909,725,910 $318,287,652 $175,562,880 $309,171,337
##  6 Christopher Nolan $1,813,227,576 $226,653,447 $196,667,606 $187,224,133
##  7 George Lucas      $1,741,418,480 $348,283,696 $380,262,555 $146,193,880
##  8 Robert Zemeckis   $1,619,309,108 $124,562,239 $100,853,835 $91,300,279 
##  9 Clint Eastwood    $1,378,321,100 $72,543,216  $46,700,000  $75,487,408 
## 10 Francis Lawrence  $1,358,501,971 $271,700,394 $281,666,058 $135,437,020
## 11 Ron Howard        $1,335,988,092 $111,332,341 $101,587,923 $81,933,761 
## 12 Gore Verbinski    $1,329,600,995 $189,942,999 $123,207,194 $154,473,822
## 13 Andrew Adamson    $1,137,446,920 $284,361,730 $279,680,930 $120,895,765
## 14 Shawn Levy        $1,129,750,988 $102,704,635 $85,463,309  $65,484,773 
## 15 Ridley Scott      $1,128,857,598 $80,632,686  $47,775,715  $68,812,285

  • Finally, ratings. Produce a table that describes how ratings are distributed by genre. We don’t want just the mean, but also, min, max, median, SD and some kind of a histogram or density graph that visually shows how ratings are distributed.
ratings <- movies%>%
  group_by(genre)%>%
  summarise(avg_rating = mean(rating), min_rating = min(rating), max_rating = max(rating), median_rating = median(rating), sd_rating = sd(rating))%>%
  arrange(desc(avg_rating))
ratings
## # A tibble: 17 × 6
##    genre       avg_rating min_rating max_rating median_rating sd_rating
##    <chr>            <dbl>      <dbl>      <dbl>         <dbl>     <dbl>
##  1 Biography         7.11        4.5        8.9          7.2      0.760
##  2 Crime             6.92        4.8        9.3          6.9      0.849
##  3 Mystery           6.86        4.6        8.5          6.9      0.882
##  4 Musical           6.75        6.3        7.2          6.75     0.636
##  5 Drama             6.73        2.1        8.8          6.8      0.917
##  6 Documentary       6.66        1.6        8.5          7.4      1.77 
##  7 Sci-Fi            6.66        5          8.2          6.4      1.09 
##  8 Animation         6.65        4.5        8            6.9      0.968
##  9 Romance           6.65        6.2        7.1          6.65     0.636
## 10 Adventure         6.51        2.3        8.6          6.6      1.09 
## 11 Family            6.5         5.7        7.9          5.9      1.22 
## 12 Action            6.23        2.1        9            6.3      1.03 
## 13 Fantasy           6.15        4.3        7.9          6.45     0.959
## 14 Comedy            6.11        1.9        8.8          6.2      1.02 
## 15 Horror            5.83        3.6        8.5          5.9      1.01 
## 16 Western           5.7         4.1        7.3          5.7      2.26 
## 17 Thriller          4.8         4.8        4.8          4.8     NA
overall<-ggplot(movies, aes(x=rating))+
  geom_histogram(color="black", fill="white")+
  geom_vline(aes(xintercept=mean(rating), color = "Average Rating"), linetype = "dashed")+
  labs(title = "Rating distribution", subtitle = "A histogram on overall rating across genres", x = "Rating")+
  scale_color_manual(name = "Legend", values = c("Average Rating" = "orange"))+ #We adjust the legend to display the proper label for the mean line
  theme_bw()
overall

rating_by_genre<-ggplot(movies, aes(x=rating))+
  geom_histogram(color="black", fill="white")+
  facet_wrap(vars(genre), scales = "free_y")+ #we decided that it would be more visually informative, if the y axis adapts from histogram to histogram.
  labs(title = "Rating distribution", subtitle = "Histograms per genre")+
  theme_bw()
rating_by_genre


  • Examine the relationship between gross and cast_facebook_likes. Produce a scatterplot and write one sentence discussing whether the number of facebook likes that the cast has received is likely to be a good predictor of how much money a movie will make at the box office. What variable are you going to map to the Y- and X- axes?
all_scatter_likes <- ggplot(movies, aes(x=cast_facebook_likes, y=gross))+
  geom_point(aes(color=genre))+
  geom_smooth(method = "lm")+
  
  scale_x_continuous(labels = number)+ #we use the scales package to enhance the axes' readability 
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Likes and gross revenue", subtitle = "An undadjusted look at the relationship between casts' facebook popularity and the movie's gross revenue", x = "Casts' likes on facebook", y = "Gross Revenue")+
  theme_bw()

all_scatter_likes

#In the following, we left out the visualization of outlying values in order to make the graph easier to read.
adjusted_scatter_likes <- ggplot(movies,aes(x=cast_facebook_likes, y=gross))+

  geom_point(aes(color=genre))+
  geom_smooth(method = "lm")+

  xlim(0, 150000)+ #we limit the display of values on the x and y axes. This does not exclude the values from the regression line calculation, but only enhances readability.
  ylim(0,500000000)+
  
  labs(title = "Likes and gross revenue", subtitle = "An adjusted look at the relationship between casts' facebook popularity and the movie's gross revenue", x = "Casts' likes on facebook", y = "Gross Revenue")+
  theme_bw()

adjusted_scatter_likes

From the trendline and correlation coefficient of 0.28 one can observe that there is a weak positive correlation between the two variables indicating that the casts’ Facebook popularity may help predict the gross revenue of a movie to a slight extent.


  • Examine the relationship between gross and budget. Produce a scatterplot and write one sentence discussing whether budget is likely to be a good predictor of how much money a movie will make at the box office.
all_scatter_budget <- ggplot(movies, aes(x=budget, y=gross))+
  geom_point(aes(color=genre))+
  geom_smooth(method = "lm")+
  
  scale_x_continuous(labels = dollar)+
  scale_y_continuous(labels = dollar)+

  labs(title = "Budget and gross revenue", subtitle = "A look at the relationship between the movie's budget and its gross revenue", x = "Budget", y = "Gross Revenue")+

  theme_bw()
all_scatter_budget

Compared to the casts’ facebook popularity, budget seems to be a stronger predictor of a movie’s gross revenue as the correlation coefficient is closer to 1.


  • Examine the relationship between gross and rating. Produce a scatterplot, faceted by genre and discuss whether IMDB ratings are likely to be a good predictor of how much money a movie will make at the box office. Is there anything strange in this dataset?
all_scatter_rating <- ggplot(movies, aes(x=rating, y=gross))+
  geom_point(aes(color=genre), size=0.5)+
  ylim(0,750000000)+
  geom_smooth(method = "lm")+
  facet_wrap(vars(genre), scales = "free_y")+ #we decide to only have a changing y axis across the charts to make identifying smaller values easier 
  
  scale_y_continuous(labels = dollar_format(scale = 1/10000000, prefix = "m $"))+ #we use the scales package to scale the y axis to million $
  
  labs(title = "Rating and gross revenue", subtitle = "A look at the relationship between a movie's rating and its gross revenue", x = "Rating", y = "Gross Revenue")+
  
  theme_bw()

all_scatter_rating

Overall, one can observe a positive correlation between a movie’s rating and its gross revenue, however, as with casts’ facebook likes, the correlation is weak. Due to the small sample size of some genres (e.g. Musical, Western and Sci-Fi), not all trendlines are representative.


3 Returns of financial stocks

nyse <- read_csv(here::here("data","nyse.csv"))
  • Based on this dataset, create a table and a bar plot that shows the number of companies per sector, in descending order
# YOUR CODE GOES HERE
sector_df<-nyse%>%
  group_by(sector)%>%
  summarise(number = count(sector))%>%
  arrange(desc(number))
sector_df
## # A tibble: 12 × 2
##    sector                number
##    <chr>                  <int>
##  1 Finance                   97
##  2 Consumer Services         79
##  3 Public Utilities          60
##  4 Capital Goods             45
##  5 Health Care               45
##  6 Energy                    42
##  7 Technology                40
##  8 Basic Industries          39
##  9 Consumer Non-Durables     31
## 10 Miscellaneous             12
## 11 Transportation            10
## 12 Consumer Durables          8
barplot_sector<- ggplot(sector_df, aes(y=fct_reorder(sector, number), x=number))+
  geom_col(fill = "darkblue", width = 0.5)+
  geom_text(aes(label = number), hjust = 3, color = "white")+ #we display the # of companies and use hjust to move the labels to the left.
  labs(title = "Sector differences", subtitle = "The number of companies per sector", y = "Sector", x= "Number of companies")+
  
  theme_bw()

barplot_sector


  • Next, let’s choose some stocks and their ticker symbols and download some data. You MUST choose 6 different stocks from the ones listed below; You should, however, add SPY which is the SP500 ETF (Exchange Traded Fund).
# Notice the cache=TRUE argument in the chunk options. Because getting data is time consuming, 
# cache=TRUE means that once it downloads data, the chunk will not run again next time you knit your Rmd

myStocks <- c("GREK","GME","AMC","AAPL","COW","TSLA","SPY" ) %>%
  tq_get(get  = "stock.prices",
         from = "2011-01-01",
         to   = "2021-08-31") %>%
  group_by(symbol) 

glimpse(myStocks) # examine the structure of the resulting data frame
## Rows: 16,018
## Columns: 8
## Groups: symbol [7]
## $ symbol   <chr> "GREK", "GREK", "GREK", "GREK", "GREK", "GREK", "GREK", "GREK…
## $ date     <date> 2011-12-08, 2011-12-09, 2011-12-12, 2011-12-13, 2011-12-14, …
## $ open     <dbl> 44.8, 44.1, 44.4, 44.8, 41.9, 41.8, 42.0, 40.9, 42.3, 42.1, 4…
## $ high     <dbl> 44.9, 44.7, 44.4, 44.8, 41.9, 42.0, 42.2, 43.4, 42.3, 42.1, 4…
## $ low      <dbl> 44.2, 44.1, 42.5, 42.0, 40.8, 41.8, 39.9, 40.4, 41.4, 42.1, 4…
## $ close    <dbl> 44.3, 44.4, 42.7, 42.3, 41.5, 42.0, 41.2, 41.0, 41.4, 42.1, 4…
## $ volume   <dbl> 1167, 667, 2700, 2633, 200, 1267, 3667, 1467, 1100, 167, 700,…
## $ adjusted <dbl> 38.3, 38.4, 36.9, 36.6, 36.0, 36.3, 35.7, 35.5, 35.8, 36.5, 3…

#calculate daily returns
myStocks_returns_daily <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "daily", 
               type       = "log",
               col_rename = "daily_returns",
               cols = c(nested.col))  

#calculate monthly  returns
myStocks_returns_monthly <- myStocks %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "monthly", 
               type       = "arithmetic",
               col_rename = "monthly_returns",
               cols = c(nested.col)) 

#calculate yearly returns
myStocks_returns_annual <- myStocks %>%
  group_by(symbol) %>%
  tq_transmute(select     = adjusted, 
               mutate_fun = periodReturn, 
               period     = "yearly", 
               type       = "arithmetic",
               col_rename = "yearly_returns",
               cols = c(nested.col))
  • Create a table where you summarise monthly returns for each of the stocks and SPY; min, max, median, mean, SD.
# YOUR CODE GOES HERE
summary_monthly<-myStocks_returns_monthly%>%
  group_by(symbol)%>%
  summarise(min_return = min(monthly_returns), max_return = max(monthly_returns), median_return = median(monthly_returns), avg_return = mean(monthly_returns), sd_return = sd(monthly_returns))
summary_monthly
## # A tibble: 7 × 6
##   symbol min_return max_return median_return avg_return sd_return
##   <chr>       <dbl>      <dbl>         <dbl>      <dbl>     <dbl>
## 1 AAPL       -0.181     0.217        0.0257     0.0245     0.0785
## 2 AMC        -0.504     5.25         0.00679    0.0798     0.611 
## 3 COW        -0.134     0.0762       0.00264   -0.00524    0.0485
## 4 GME        -0.687    16.3         -0.0118     0.142      1.45  
## 5 GREK       -0.332     0.369        0.00609    0.00365    0.110 
## 6 SPY        -0.125     0.127        0.0174     0.0123     0.0381
## 7 TSLA       -0.224     0.811        0.0148     0.0523     0.176

  • Plot a density plot, using geom_density(), for each of the stocks
# YOUR CODE GOES HERE
density<-ggplot(myStocks_returns_monthly, aes(x=monthly_returns))+
  geom_density()+
  facet_wrap(vars(symbol), scales="free")+ #we use free scales per density plot in order to enhance readability.
  
  labs(title = "Return Distribution", subtitle = "Density plots on monthly returns", x = "Monthly return", y = "Percentage")

density

  • What can you infer from this plot? Which stock is the riskiest? The least risky?

From the density plots, one can observe that volatility in monthly return varies widely across the stocks/ETF. For instance, the returns of COW seem to be more evenly distributed than GME, indicating a lower risk when investing into them. In our eyes, GME seems to be the riskiest stock, while an investment into GREK carries the least risk. The latter is understandable as the GREK ETF is diversified and holds shares of the largest and most liquid companies in Greece (source).

Looking at the x axis, one can also observe a large difference in the range of monthly returns across the instruments. Interestingly, most of the distribution lies around the 0 mark for all displayed tickers.


  • Finally, make a plot that shows the expected monthly return (mean) of a stock on the Y axis and the risk (standard deviation) in the X-axis. Please use ggrepel::geom_text_repel() to label each stock
# YOUR CODE GOES HERE
scatter_return_volatility <- ggplot(summary_monthly, aes(x=sd_return, y=avg_return, label=symbol))+
  geom_point()+
  geom_smooth(method = "lm", se = F)+ #in this case, we do not want a confidence interval around the regression line
  geom_text_repel()+
  
  labs(title = "Risk vs. Reward", subtitle = "A look at the relationship between the instruments' standard deviation and average monthly return", x = "Standard deviation", y = "Average monthly return")+
  
  theme_bw()

scatter_return_volatility

  • What can you infer from this plot? Are there any stocks which, while being riskier, do not have a higher expected return?

Generally, the market seems to reward higher risk with higher return. However, there are minor exceptions in our sample: GREK, indexing the Greek market, has a higher SD than SPY, yet its average monthly return is lower. A good approximation for comparing risk and reward across instruments is the blue linear regression line. Tickers above this line indicate a greater reward for a given risk compared to tickers below that line


4 On your own: IBM HR Analytics

hr_dataset <- read_csv(here::here("data", "datasets_1067_1925_WA_Fn-UseC_-HR-Employee-Attrition.csv"))
glimpse(hr_dataset)
## Rows: 1,470
## Columns: 35
## $ Age                      <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "No", "…
## $ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "Travel…
## $ DailyRate                <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <chr> "Sales", "Research & Development", "Research …
## $ DistanceFromHome         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <chr> "Life Sciences", "Life Sciences", "Other", "L…
## $ EmployeeCount            <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <chr> "Female", "Male", "Male", "Female", "Male", "…
## $ HourlyRate               <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <chr> "Sales Executive", "Research Scientist", "Lab…
## $ JobSatisfaction          <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <chr> "Single", "Married", "Single", "Married", "Ma…
## $ MonthlyIncome            <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
## $ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "Yes",…
## $ PercentSalaryHike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …
head(hr_cleaned)
## # A tibble: 6 × 19
##     age attrition daily_rate department      distance_from_ho… education  gender
##   <dbl> <chr>          <dbl> <chr>                       <dbl> <chr>      <chr> 
## 1    41 Yes             1102 Sales                           1 College    Female
## 2    49 No               279 Research & Dev…                 8 Below Col… Male  
## 3    37 Yes             1373 Research & Dev…                 2 College    Male  
## 4    33 No              1392 Research & Dev…                 3 Master     Female
## 5    27 No               591 Research & Dev…                 2 Below Col… Male  
## 6    32 No              1005 Research & Dev…                 2 College    Male  
## # … with 12 more variables: job_role <chr>, environment_satisfaction <chr>,
## #   job_satisfaction <chr>, marital_status <chr>, monthly_income <dbl>,
## #   num_companies_worked <dbl>, percent_salary_hike <dbl>,
## #   performance_rating <chr>, total_working_years <dbl>,
## #   work_life_balance <chr>, years_at_company <dbl>,
## #   years_since_last_promotion <dbl>
skim(hr_cleaned)
Data summary
Name hr_cleaned
Number of rows 1470
Number of columns 19
_______________________
Column type frequency:
character 10
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
attrition 0 1 2 3 0 2 0
department 0 1 5 22 0 3 0
education 0 1 6 13 0 5 0
gender 0 1 4 6 0 2 0
job_role 0 1 7 25 0 9 0
environment_satisfaction 0 1 3 9 0 4 0
job_satisfaction 0 1 3 9 0 4 0
marital_status 0 1 6 8 0 3 0
performance_rating 0 1 9 11 0 2 0
work_life_balance 0 1 3 6 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1 36.92 9.14 18 30 36 43 60 ▂▇▇▃▂
daily_rate 0 1 802.49 403.51 102 465 802 1157 1499 ▇▇▇▇▇
distance_from_home 0 1 9.19 8.11 1 2 7 14 29 ▇▅▂▂▂
monthly_income 0 1 6502.93 4707.96 1009 2911 4919 8379 19999 ▇▅▂▁▂
num_companies_worked 0 1 2.69 2.50 0 1 2 4 9 ▇▃▂▂▁
percent_salary_hike 0 1 15.21 3.66 11 12 14 18 25 ▇▅▃▂▁
total_working_years 0 1 11.28 7.78 0 6 10 15 40 ▇▇▂▁▁
years_at_company 0 1 7.01 6.13 0 3 5 9 40 ▇▂▁▁▁
years_since_last_promotion 0 1 2.19 3.22 0 0 1 3 15 ▇▁▁▁▁

The hr_cleaned dataset holds fictional information of a total of 1470 employees on variables such as the employees’ education, job specifics, performance and attrition. All in all, there are 9 numeric and 10 non-numeric variables

Given the wide range of information, one can try to make inferences about dependencies among variables. For example, it might be interesting to find out if a combination of variables might be a good predictor for an employee’s decision to leave the company (attrition = Yes).

  • How often do people leave the company (attrition)
number_attrition<-hr_cleaned%>%
  group_by(attrition)%>%
  summarize(number_count = count(attrition))%>%
  mutate(perc = number_count/sum(number_count)) #we introduce a second column, namely percent, to represent the count as a percentage of the total

number_attrition
## # A tibble: 2 × 3
##   attrition number_count  perc
##   <chr>            <int> <dbl>
## 1 No                1233 0.839
## 2 Yes                237 0.161

Looking at the sheer counts of YES and NO entries for the attrition coloumn, around 237 of all employees have left the company, while 1233 have been staying employed. This corresponds to around 16% of all recorded employees.

  • How are age, years_at_company, monthly_income and years_since_last_promotion distributed? can you roughly guess which of these variables is closer to Normal just by looking at summary statistics?
n<-length(hr_cleaned$age) 

pic_data<-data.frame(cbind(c(hr_cleaned$age,hr_cleaned$monthly_income, 
                  hr_cleaned$years_since_last_promotion), 
                c(rep("age",n),rep("monthly_income",n),rep("years_since_last_promotion",n)))) 
colnames(pic_data)<-c("Num","Attribute") 
pic_data$Num<-as.integer(pic_data$Num) 

distribution<-ggplot(pic_data, aes(x=Num))+ 
  geom_histogram()+ 
  facet_wrap(vars(Attribute), scale="free")+ 
  labs(title = "Distribution across the dataset", subtitle = "Counts of values for age, monthly income and years since last promotion", x="",y="Frequency")+ 
  
  theme_bw() 
distribution 

  • How are job_satisfaction and work_life_balance distributed? Don’t just report counts, but express categories as % of total
hr_satis <- hr_cleaned %>% 
  group_by(job_satisfaction) %>% 
  count(sort=TRUE) %>% 
  mutate(percent=n/1470)

hr_satis 
## # A tibble: 4 × 3
## # Groups:   job_satisfaction [4]
##   job_satisfaction     n percent
##   <chr>            <int>   <dbl>
## 1 Very High          459   0.312
## 2 High               442   0.301
## 3 Low                289   0.197
## 4 Medium             280   0.190
work_life_satis <- hr_cleaned %>% 
  group_by(work_life_balance) %>% 
  count(sort=TRUE) %>% 
  mutate(percent=n/1470) 
 
work_life_satis 
## # A tibble: 4 × 3
## # Groups:   work_life_balance [4]
##   work_life_balance     n percent
##   <chr>             <int>   <dbl>
## 1 Better              893  0.607 
## 2 Good                344  0.234 
## 3 Best                153  0.104 
## 4 Bad                  80  0.0544
  • Is there any relationship between monthly_income and education? monthly_income and gender?
plot1 <- ggplot(hr_cleaned, aes(x = fct_reorder(education, -monthly_income), y=monthly_income))+ #we include a negative "-" sign infront of monthly_income in the fct_reorder function, in order to sort the x axis based on decreasing y values
  geom_boxplot()+
  
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Education vs income", subtitle = "A look at an employee's academic background and monthly income", x = "Education", y = "Monthly income")+
  
  theme_bw()
plot1 

plot2 <-ggplot(hr_cleaned, aes(x = fct_reorder(gender, -monthly_income), y=monthly_income))+ 
  geom_boxplot()+
  
  scale_y_continuous(label = dollar)+
  
  labs(title = "Gender vs income", subtitle = "A look at an employee's sex and monthly income", x = "Gender", y = "Monthly income")+
  
  theme_bw()
plot2 

  • Plot a boxplot of income vs job role. Make sure the highest-paid job roles appear first
income_vs_job <- ggplot(hr_cleaned, aes(x=fct_reorder(job_role, -monthly_income), y=monthly_income)) + 
  geom_boxplot()+
  
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Income distribution by job", subtitle = "A look at the relationship between job titles and monthly income", x = "Job" , y="Monthly income")+
  
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) #we adjust the labels on the x axis such that they are vertically aligned

  

income_vs_job

  • Calculate and plot a bar chart of the mean (or median?) income by education level.
#We opted for plotting the MEDIAN monthly income, as it is typically less prone to outliers

median_income <- hr_cleaned%>%
  group_by(education)%>%
  summarise(median_monthly_income = median(monthly_income))%>%
  
  ggplot(aes(x = fct_reorder(education, -median_monthly_income), y = median_monthly_income))+
    geom_col(fill = "lightblue")+
  
    scale_y_continuous(labels = dollar)+
  
    labs(title = "Median income by education", subtitle = "A look at the relationship between academic background and median monthly income", x = "Education" , y="Median monthly income")+
  
    theme_bw()

median_income

  • Plot the distribution of income by education level. Use a facet_wrap and a theme from ggthemes
plot7 <- hr_cleaned %>%  
  group_by(education) %>%  
  ggplot(aes(x=monthly_income))+ 
    geom_histogram(fill = "lightblue")+
    facet_wrap(vars(education), scales = "free_y")+
  
    scale_x_continuous(labels = dollar)+
  
    labs(title = "Income distribution by education", subtitle = "A look at the relationship between edcuation level and monthly income", x = "Monthly income" , y="Count")+
  
    ggthemes::theme_stata()+
    theme(plot.background = element_blank()) #we remove the blue background to make the desing more coherent
plot7 

  • Plot income vs age, faceted by job_role
plot8 <- ggplot(hr_cleaned, aes(x=age, y=monthly_income))+ 
  geom_point(size=0.1)+ 
  geom_smooth(method = "lm")+
  facet_wrap(vars(job_role), scales = "free")+
  
  scale_y_continuous(labels = dollar)+
  
  labs(title = "Income by age and job title", subtitle = "A look at the relationship between age, job title and monthly income level", x = "Age" , y="Monthly income")+
  
  ggthemes::theme_stata()+
  theme(plot.background = element_blank())

plot8 


5 Challenge 1: Replicating a chart

We would like to replicate the following chart.

First we do some cleaning. As the vaccination data is time series, but we are only interested in the most recent status of vacc. percent per county, we can limit our dataframe to Sep. 3rd data.

In addition, we add a column to the election dataset which calculates the number of votes for one candidate as a percentage of the total number of votes in every given county.

#as there are multiple dates for vacc statuses, and we are only interested in the most recent one, we can do some cleaning. Data from Sep. 3rd, 2021, seems to be most recent.

vaccinations_recent <- vaccinations %>%
  filter(date=="09/03/2021")

#we can also include an additional coloumn to the election dataset, indicating the number of votes for one candidate as a percentage of whole

elections_perc <- election2020_results%>%
  mutate(perc_votes = candidatevotes/totalvotes)

Next up, we create a new dataframe, based on joined values from the three original ones. Here, we only include the data really necessary for the reproduction of the graph.

Let’s have a sneak peak at the output dataframe. It looks like this is all the data we really need.

#We decide to just join coloumn series_complete_pop_pct from the vacc dataset, as this should suffice to recreate the graph. We use inner join here, as we have a key variable, i.e. fips, to join all three datasets by.

#We also rename the coloumns to make their names a little more easy on the tongue.

joined_frame <- inner_join(population, vaccinations_recent[,c("fips", "series_complete_pop_pct") ], by = "fips")%>%
  rename("vacc_perc" = "series_complete_pop_pct", "pop" = "pop_estimate_2019")

#Similarly, we join parts of the cleaned election data, and filter out records other than votes for Trump, as they are not needed for the graph. In addition, we sum up perc_votes because there is sometimes data for different ways of voting, i.e. early vote, election day, provisional.

joined_frame <- inner_join(joined_frame, elections_perc[,c("fips", "candidate", "perc_votes")])%>%
  filter(candidate=="DONALD J TRUMP")%>%
  group_by(fips, pop, vacc_perc, candidate, )%>%
  summarise(sum(perc_votes))%>%
  rename("perc_votes_total" = "sum(perc_votes)") #sum, to take into consideration all kinds of voting methods.

head(joined_frame) #we can have a look at the dataset. It looks like we have all the data we need for visualization
## # A tibble: 6 × 5
## # Groups:   fips, pop, vacc_perc [6]
##   fips     pop vacc_perc candidate      perc_votes_total
##   <chr>  <dbl>     <dbl> <chr>                     <dbl>
## 1 01001  55869      30.5 DONALD J TRUMP            0.714
## 2 01003 223234      37.9 DONALD J TRUMP            0.762
## 3 01005  24686      31.3 DONALD J TRUMP            0.535
## 4 01007  22394      26.6 DONALD J TRUMP            0.784
## 5 01009  57826      22.6 DONALD J TRUMP            0.896
## 6 01011  10101      38.4 DONALD J TRUMP            0.248

Now that we have created the dataframe with all necessary data, let’s get to the attractive part of this exercise: making a beautiful plot.

plot_counties <- ggplot(joined_frame, aes(x=perc_votes_total, vacc_perc))+
  geom_rect(aes(xmin = 0, xmax = 0.45, ymin = 0, ymax = 100), fill = "#A6A8FB")+ #we add three boxes to the plot that have HEX values for the blue, purple and red colours respectively.
  geom_rect(aes(xmin = 0.45, xmax = 0.55, ymin = 0, ymax = 100), fill = "#D2A6D2")+
  geom_rect(aes(xmin = 0.55, xmax = 1, ymin = 0, ymax = 100), fill = "#FDA7A7")+
  
  geom_point(aes(size=pop), alpha = 1/3)+
  geom_smooth(method = "lm")+
  
  labs(title = "COVID19 vaccination levels versus Trump voting", subtitle = "A look at the relationship between 2020 US election results and percentages of fully vaccinated people, per county as of Sep. 3rd, 2021", x = "Trump voting", y = "County's percentage of fully vaccinated people")+
  
  theme_bw()

plot_counties

It seems like there are a number of outliers in the dataset. 276 counties have reported a full vaccination count of 0 for the Sep. 3rd collection. For instance, as you can see in the following output dataframe, Bexar county, home to a population of 2003554, reported a vaccination rate of 0%.

filter(joined_frame[ ,c("fips", "pop", "vacc_perc")], fips==48029)
## # A tibble: 1 × 3
## # Groups:   fips, pop, vacc_perc [1]
##   fips      pop vacc_perc
##   <chr>   <dbl>     <dbl>
## 1 48029 2003554         0

It might be appropriate, to exclude records with vacc_perc = 0 from out dataset, as it seems very hard to believe that in a county like Bexar, home to 2003554 people, not a single one is fully vaccinated. A data collection error like this potentially has a large impact on the regression line. We therefore create an adjusted DF.

#we first exclude records with vacc_perc equal to 0
adj_joined_fram <- joined_frame%>%
  filter(vacc_perc!=0)

#here, we simply replicate the plot, but this time with adjusted values.
adj_plot_counties <- ggplot(adj_joined_fram, aes(x = perc_votes_total, y = vacc_perc))+
  geom_rect(aes(xmin = 0, xmax = 0.45, ymin = 0, ymax = 100), fill = "#A6A8FB")+
  geom_rect(aes(xmin = 0.45, xmax = 0.55, ymin = 0, ymax = 100), fill = "#D2A6D2")+
  geom_rect(aes(xmin = 0.55, xmax = 1, ymin = 0, ymax = 100), fill = "#FDA7A7")+
  
  geom_point(aes(size = pop), alpha = 1/3)+
  geom_smooth(method = "lm")+
  
  labs(title = "ADJUSTED - COVID19 vaccination levels versus Trump voting", subtitle = "An adjusted look at the relationship between 2020 US election results and percentages of fully vaccinated people, per county as of Sep. 3rd, 2021", x = "Trump voting", y = "County's percentage of fully vaccinated people")+

  theme_bw()
adj_plot_counties

6 Challenge 2: Opinion polls for the 2021 German elections

url <- "https://en.wikipedia.org/wiki/Opinion_polling_for_the_2021_German_federal_election"
# https://www.economist.com/graphic-detail/who-will-succeed-angela-merkel
# https://www.theguardian.com/world/2021/jun/21/german-election-poll-tracker-who-will-be-the-next-chancellor


# get tables that exist on wikipedia page 
tables <- url %>% 
  read_html() %>% 
  html_nodes(css="table")


# parse HTML tables into a dataframe called polls 
# Use purr::map() to create a list of all tables in URL
polls <- map(tables, . %>% 
             html_table(fill=TRUE)%>% 
             janitor::clean_names())


# list of opinion polls
german_election_polls <- polls[[1]] %>% # the first table on the page contains the list of all opinions polls
  slice(2:(n()-1)) %>%  # drop the first row, as it contains again the variable names and last row that contains 2017 results
  mutate(
         # polls are shown to run from-to, e.g. 9-13 Aug 2021. We keep the last date, 13 Aug here, as the poll date
         # and we extract it by picking the last 11 characters from that field
         end_date = str_sub(fieldwork_date, -11),
         
         # end_date is still a string, so we convert it into a date object using lubridate::dmy()
         end_date = dmy(end_date),
         
         # we also get the month and week number from the date, if we want to do analysis by month- week, etc.
         month = month(end_date),
         week = isoweek(end_date)
         )

Now that we have fetched all the data from the wikipedia table into our german_election_polls data frame, let’s try to visualize our data.

plot_election_polls <- ggplot(german_election_polls, aes(x = end_date))+ #because we have multiple lines (representing multiple parties), we only map the x axis uptop here. We map y-values when we create every scatter and line plot separately for each party.
  
  geom_point(aes(y = union), color = "black", alpha = 0.3)+ 
  geom_line(aes(y=rollmean(union, 14, na.pad=TRUE), color = "black"))+ #we add a line representing the rolling average of values for the past 14 days. We also adjust the padding for enhanced visualization.+
  
  geom_point(aes(y = spd), color = "red", alpha = 0.3)+
  geom_line(aes(y=rollmean(spd, 14, na.pad=TRUE), color = "red"))+
  
  geom_point(aes(y = grune), color = "darkgreen", alpha = 0.3)+
  geom_line(aes(y=rollmean(grune, 14, na.pad=TRUE), color = "darkgreen"))+
  
  geom_point(aes(y = af_d), color = "blue", alpha = 0.3)+
  geom_line(aes(y=rollmean(af_d, 14, na.pad=TRUE), color = "blue"))+
  
  geom_point(aes(y = fdp), color = "#F4E332", alpha = 0.3)+
  geom_line(aes(y=rollmean(fdp, 14, na.pad=TRUE), color = "#F4E332"))+
  
  geom_point(aes(y=linke), color = "darkred", alpha = 0.3)+
  geom_line(aes(y=rollmean(linke, 14, na.pad = TRUE), color = "darkred"))+
  
  
  #we have desperately tried to add a functioning legend to our plot. We finally managed to do so with the scale color identity function, after mapping the correct colors to the variable aesthetics before. Inspiration for this code came from https://aosmith.rbind.io/2018/07/19/manual-legends-ggplot2/ 
  scale_color_identity(breaks = c("black", "red", "darkgreen", "blue", "#F4E332", "darkred"),
                          labels = c("CDU/CSU", "SPD", "Grüne", "AfD", "FDP", "Linke"),
                          guide = "legend",
                          name = "Legend")+
  
  labs(title = "German elections", subtitle = "Change in estimated voting during 2021", x="2021 Poll Month", y="Voting Percentage")+
  
  theme_bw()
 
plot_election_polls

7 Details

  • Who did you collaborate with: Group 3 (Jiaying Chen, Yingjin He, Sabrina Seow, Kashish Solanki, Roman Vazquez Lorenzo, Maximilian Stock)
  • Approximately how much time did you spend on this problem set: 10 hours
  • What, if anything, gave you the most trouble: Finding a way to add the legend to the election poll graph was more complicated than initally expected.

8 Rubric

Check minus (1/5): Displays minimal effort. Doesn’t complete all components. Code is poorly written and not documented. Uses the same type of plot for each graph, or doesn’t use plots appropriate for the variables being analyzed.

Check (3/5): Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output).

Check plus (5/5): Finished all components of the assignment correctly and addressed both challenges. Code is well-documented (both self-documented and with additional comments as necessary). Used tidyverse, instead of base R. Graphs and tables are properly labelled. Analysis is clear and easy to follow, either because graphs are labeled clearly or you’ve written additional text to describe how you interpret the output.